logo

Introduction

The purpose of this project is to gauge your technical skills and problem solving ability by working through something similar to a real NBA data science project. You will work your way through this R Markdown document, answering questions as you go along. Please begin by adding your name to the “author” key in the YAML header. When you’re finished with the document, come back and type your answers into the answer key at the top. Please leave all your work below and have your answers where indicated below as well. Please note that we will be reviewing your code so make it clear, concise and avoid long printouts. Feel free to add in as many new code chunks as you’d like.

Remember that we will be grading the quality of your code and visuals alongside the correctness of your answers. Please try to use the tidyverse as much as possible (instead of base R and explicit loops.)

Note:

Throughout this document, any season column represents the year each season started. For example, the 2015-16 season will be in the dataset as 2015. For most of the rest of the project, we will refer to a season by just this number (e.g. 2015) instead of the full text (e.g. 2015-16).

Answers

Part 1

Question 1:

  • 1st Team: XX.X points per game
  • 2nd Team: XX.X points per game
  • 3rd Team: XX.X points per game
  • All-Star: XX.X points per game

Question 2: XX.X Years

Question 3:

  • Elite: X players.
  • All-Star: X players.
  • Starter: X players.
  • Rotation: X players.
  • Roster: X players.
  • Out of League: X players.

Open Ended Modeling Question: Please show your work and leave all responses below in the document.

Part 2

Question 1: XX.X%
Question 2: Written question, put answer below in the document.
Question 3: Written question, put answer below in the document.

Setup and Data

library(tidyverse)
# Note, you will likely have to change these paths. If your data is in the same folder as this project, 
# the paths will likely be fixed for you by deleting ../../Data/awards_project/ from each string.
awards <- read_csv("/Users/zhenyan/Downloads/awards_data.csv")
player_data <- read_csv("/Users/zhenyan/Downloads/player_stats.csv")
team_data <- read_csv("/Users/zhenyan/Downloads/team_stats.csv")
rebounding_data <- read_csv("/Users/zhenyan/Downloads/team_rebounding_data_22.csv")

Part 1 – Awards

In this section, you’re going to work with data relating to player awards and statistics. You’ll start with some data manipulation questions and work towards building a model to predict broad levels of career success.

Question 1

QUESTION: What is the average number of points per game for players in the 2007-2021 seasons who won All NBA First, Second, and Third Teams (not the All Defensive Teams), as well as for players who were in the All-Star Game (not the rookie all-star game)?

# Here and for all future questions, feel free to add as many code chunks as you like. Do NOT put echo = F though, we'll want to see your code.
library(dplyr)

# Merge player_data and awards
merged_data <- player_data %>%
  left_join(awards, by = c("season", "nbapersonid"),relationship="many-to-many")
# Filter based on criteria and compute average points per game for each 
first_team_avg_data <- merged_data %>%
  filter(season >= 2007 & season <= 2021, `All NBA First Team` == 1) %>%
  summarise(avg = mean(points / games, na.rm = TRUE))

first_team_avg <- first_team_avg_data$avg

second_team_avg_data <- merged_data %>%
  filter(season >= 2007 & season <= 2021, `All NBA Second Team` == 1) %>%
  summarise(avg = mean(points / games, na.rm = TRUE))

second_team_avg <- second_team_avg_data$avg

third_team_avg_data <- merged_data %>%
  filter(season >= 2007 & season <= 2021, `All NBA Third Team` == 1) %>%
  summarise(avg = mean(points / games, na.rm = TRUE))

third_team_avg <- third_team_avg_data$avg

allstar_team_avg_data <- merged_data %>%
  filter(season >= 2007 & season <= 2021, all_star_game == TRUE) %>%
  summarise(avg = mean(points / games, na.rm = TRUE))

allstar_team_avg <- allstar_team_avg_data$avg
round(first_team_avg,1)
## [1] 25.9
round(second_team_avg,1)
## [1] 23.1
round(third_team_avg,1)
## [1] 20.5
round(allstar_team_avg,1)
## [1] 21.6

ANSWER 1:

1st Team: 25.9 points per game
2nd Team: 23.1 points per game
3rd Team: 20.5 points per game
All-Star: 21.6 points per game

Question 2

QUESTION: What was the average number of years of experience in the league it takes for players to make their first All NBA Selection (1st, 2nd, or 3rd team)? Please limit your sample to players drafted in 2007 or later who did eventually go on to win at least one All NBA selection. For example:

  • Luka Doncic is in the dataset as 2 years. He was drafted in 2018 and won his first All NBA award in 2019 (which was his second season).
  • LeBron James is not in this dataset, as he was drafted prior to 2007.
  • Lu Dort is not in this dataset, as he has not received any All NBA honors.
library(dplyr)

# Filter players drafted in 2007 or later
drafted_after_2006 <- player_data %>%
  filter(draftyear >= 2007)

# Identify players with at least one All NBA selection
all_nba_players <- awards %>%
  filter(
    (`All NBA First Team` == 1 | `All NBA Second Team` == 1 | `All NBA Third Team` == 1) &
    nbapersonid %in% drafted_after_2006$nbapersonid
  )

# Calculate difference between draft year and the first All NBA award
years_to_first_award <- all_nba_players %>%
  group_by(nbapersonid) %>%
  summarise(first_award_year = min(season)) %>%#first award year
  left_join(drafted_after_2006, by = "nbapersonid") %>%
  mutate(years_to_first = first_award_year - draftyear)

# Compute average
avg_years_to_first_award <- mean(years_to_first_award$years_to_first, na.rm = TRUE)

print(round(avg_years_to_first_award,1))
## [1] 3.8

ANSWER 2:

3.8 Years

Data Cleaning Interlude

You’re going to work to create a dataset with a “career outcome” for each player, representing the highest level of success that the player achieved for at least two seasons after his first four seasons in the league (examples to follow below!). To do this, you’ll start with single season level outcomes. On a single season level, the outcomes are:

  • Elite: A player is “Elite” in a season if he won any All NBA award (1st, 2nd, or 3rd team), MVP, or DPOY in that season.
  • All-Star: A player is “All-Star” in a season if he was selected to be an All-Star that season.
  • Starter: A player is a “Starter” in a season if he started in at least 41 games in the season OR if he played at least 2000 minutes in the season.
  • Rotation: A player is a “Rotation” player in a season if he played at least 1000 minutes in the season.
  • Roster: A player is a “Roster” player in a season if he played at least 1 minute for an NBA team but did not meet any of the above criteria.
  • Out of the League: A player is “Out of the League” if he is not in the NBA in that season.

We need to make an adjustment for determining Starter/Rotation qualifications for a few seasons that didn’t have 82 games per team. Assume that there were 66 possible games in the 2011 lockout season and 72 possible games in each of the 2019 and 2020 seasons that were shortened due to covid. Specifically, if a player played 900 minutes in 2011, he would meet the rotation criteria because his final minutes would be considered to be 900 * (82/66) = 1118. Please use this math for both minutes and games started, so a player who started 38 games in 2019 or 2020 would be considered to have started 38 * (82/72) = 43 games, and thus would qualify for starting 41. Any answers should be calculated assuming you round the multiplied values to the nearest whole number.

Note that on a season level, a player’s outcome is the highest level of success he qualifies for in that season. Thus, since Shai Gilgeous-Alexander was both All-NBA 1st team and an All-Star last year, he would be considered to be “Elite” for the 2022 season, but would still qualify for a career outcome of All-Star if in the rest of his career he made one more All-Star game but no more All-NBA teams. Note this is a hypothetical, and Shai has not yet played enough to have a career outcome.

Examples:

  • A player who enters the league as a rookie and has season outcomes of Roster (1), Rotation (2), Rotation (3), Roster (4), Roster (5), Out of the League (6+) would be considered “Out of the League,” because after his first four seasons, he only has a single Roster year, which does not qualify him for any success outcome.
  • A player who enters the league as a rookie and has season outcomes of Roster (1), Rotation (2), Starter (3), Starter (4), Starter (5), Starter (6), All-Star (7), Elite (8), Starter (9) would be considered “All-Star,” because he had at least two seasons after his first four at all-star level of production or higher.
  • A player who enters the league as a rookie and has season outcomes of Roster (1), Rotation (2), Starter (3), Starter (4), Starter (5), Starter (6), Rotation (7), Rotation (8), Roster (9) would be considered a “Starter” because he has two seasons after his first four at a starter level of production.

Question 3

QUESTION: There are 73 players in the player_data dataset who have 2010 listed as their draft year. How many of those players have a career outcome in each of the 6 buckets?

#adjustments functions
adjust_games_started <- function(year, games) {
  if (year == 2011) {
    adjusted_games <- round(games * (82/66))
  } else if (year %in% c(2019, 2020)) {
    adjusted_games <- round(games * (82/72))
  } else {
    adjusted_games <- games
  }
  return(adjusted_games)
}

adjust_minutes_played <- function(year, minutes) {
  if (year == 2011) {
    adjusted_minutes <- round(minutes * (82/66))
  } else if (year %in% c(2019, 2020)) {
    adjusted_minutes <- round(minutes * (82/72))
  } else {
    adjusted_minutes <- minutes
  }
  return(adjusted_minutes)
}
merged_data <- merged_data %>%
  rowwise() %>%
  mutate(
    adjusted_games_started = adjust_games_started(season, games_start),
    adjusted_minutes_played = adjust_minutes_played(season, mins)
  ) %>%
  ungroup()

#Players with 2010 as draft year
drafted_2010 <- merged_data %>% filter(draftyear == 2010)

#Assign single season outcomes
drafted_2010 <- drafted_2010 %>%
  mutate(season_outcome = case_when(
    `All NBA First Team`==1|`All NBA Second Team`==1|`All NBA Third Team`==1 ~ "Elite",
    all_star_game == TRUE ~ "All-Star",
    (adjusted_games_started >= 41 |adjusted_minutes_played >= 2000) ~ "Starter",
    adjusted_minutes_played >= 1000 ~ "Rotation",
    adjusted_minutes_played >= 1 ~ "Roster",
    TRUE ~ "Out of the League"
  ))
#Assign career outcomes
career_outcomes <- drafted_2010 %>%
  group_by(nbapersonid) %>%
  slice(5:n()) %>%
  count(season_outcome) %>%
  top_n(n = 1, wt = n) %>% 
  arrange(-n) %>%
  summarize(career_outcome = case_when(
    any(season_outcome == "Elite") ~ "Elite",
    any(season_outcome == "All-Star") ~ "All-Star",
    any(season_outcome == "Starter") ~ "Starter",
    any(season_outcome == "Rotation") ~ "Rotation",
    any(season_outcome == "Roster") ~ "Roster",
    TRUE ~ "Out of the League"
  ))

#Count number of players of each bucket
bucket_counts <- career_outcomes %>%
  count(career_outcome)
bucket_counts
## # A tibble: 5 × 2
##   career_outcome     n
##   <chr>          <int>
## 1 All-Star           1
## 2 Elite              1
## 3 Roster            61
## 4 Rotation           2
## 5 Starter            8

ANSWER 3:

Elite: 1 players.
All-Star: 1 players.
Starter: 8 players.
Rotation: 2 players.
Roster: 61 players.
Out of League: 0 players.

Open Ended Modeling Question

In this question, you will work to build a model to predict a player’s career outcome based on information up through the first four years of his career.

This question is intentionally left fairly open ended, but here are some notes and specifications.

  1. We know modeling questions can take a long time, and that qualified candidates will have different levels of experience with “formal” modeling. Don’t be discouraged. It’s not our intention to make you spend excessive time here. If you get your model to a good spot but think you could do better by spending a lot more time, you can just write a bit about your ideas for future improvement and leave it there. Further, we’re more interested in your thought process and critical thinking than we are in specific modeling techniques. Using smart features is more important than using fancy mathematical machinery, and a successful candidate could use a simple regression approach.

  2. You may use any data provided in this project, but please do not bring in any external sources of data. Note that while most of the data provided goes back to 2007, All NBA and All Rookie team voting is only included back to 2011.

  3. A player needs to complete at least three additional seasons after their first four to be considered as having a distinct career outcome for our dataset. (We are using 3+ instead of 2+ just to give each player a little more time to accumulate high level seasons before we classify his career). Because the dataset in this project ends in 2021, this means that a player would need to have had the chance to play in the ’21, ’20, and ’19 seasons after his first four years, and thus his first four years would have been ’18, ’17, ’16, and ’15. For this reason, limit your training data to players who were drafted in or before the 2015 season. Karl-Anthony Towns was the #1 pick in that season.

  4. Once you build your model, predict on all players who were drafted in 2018-2021 (They have between 1 and 4 seasons of data available and have not yet started accumulating seasons that inform their career outcome).

  5. You can predict a single career outcome for each player, but it’s better if you can predict the probability that each player falls into each outcome bucket.

  6. Include, as part of your answer:

  • A brief written overview of how your model works, targeted towards a decision maker in the front office without a strong statistical background.
  • What you view as the strengths and weaknesses of your model.
  • How you’d address the weaknesses if you had more time and or more data.
  • A ggplot or ggplotly visualization highlighting some part of your modeling process, the model itself, or your results.
  • Your predictions for Shai Gilgeous-Alexander, Zion Williamson, James Wiseman, and Josh Giddey.
  • (Bonus!) An html table (for example, see the package reactable) containing all predictions for the players drafted in 2019-2021.
library(dplyr)
library(tidyr)
library(ggplot2)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift

Data Preprocessing

library(tidyverse)

# Preprocess merged_data
merged_data <- merged_data %>%
  left_join(career_outcomes, by = "nbapersonid") %>%
  replace_na(list(career_outcomes = "Out of the League"))

# For training set
data_train <- merged_data %>%
  filter(draftyear <= 2015) %>%
  group_by(nbapersonid) %>%
  slice(1:4)

# Assign single season outcomes
data_train <- data_train %>%
  mutate(season_outcome = case_when(
    `All NBA First Team`==1|`All NBA Second Team`==1|`All NBA Third Team`==1 ~ "Elite",
    all_star_game == TRUE ~ "All-Star",
    (adjusted_games_started >= 41 |adjusted_minutes_played >= 2000) ~ "Starter",
    adjusted_minutes_played >= 1000 ~ "Rotation",
    adjusted_minutes_played >= 1 ~ "Roster",
    TRUE ~ "Out of the League"
  ))

# Compute career_outcomes for each nbapersonid
career_outcomes <- data_train %>%
  group_by(nbapersonid) %>%
  slice(5:n()) %>%
  count(season_outcome) %>%
  top_n(n = 1, wt = n) %>%
  summarize(career_outcome = case_when(
    any(season_outcome == "Elite") ~ "Elite",
    any(season_outcome == "All-Star") ~ "All-Star",
    any(season_outcome == "Starter") ~ "Starter",
    any(season_outcome == "Rotation") ~ "Rotation",
    any(season_outcome == "Roster") ~ "Roster",
    TRUE ~ "Out of the League"
  )) %>%
  ungroup()

# Handle NA values in player_data
player_data[is.na(player_data)] <- 0

numeric_columns_player_data <- sapply(player_data, is.numeric)
avg_player_data_numeric <- player_data %>%
  filter(draftyear <= 2015) %>%
  select(which(numeric_columns_player_data)) %>%
  group_by(nbapersonid, season) %>%
  summarise_all(mean, na.rm=TRUE) %>%
  ungroup()

player_aggregated_data <- avg_player_data_numeric %>%
  group_by(nbapersonid) %>%
  summarise(across(where(is.numeric), mean, na.rm = TRUE)) %>%
  ungroup()
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(where(is.numeric), mean, na.rm = TRUE)`.
## ℹ In group 1: `nbapersonid = 15`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
# Join with career_outcomes to get training_set
training_set <- career_outcomes %>%
  left_join(player_aggregated_data, by = "nbapersonid")

head(training_set)
## # A tibble: 6 × 48
##   nbapersonid career_outcome season draftyear draftpick  nbateamid games
##         <dbl> <chr>           <dbl>     <dbl>     <dbl>      <dbl> <dbl>
## 1          15 Roster          2007       1994        15 1610612756  16  
## 2          87 Roster          2008.      1991         4 1610612745  24  
## 3         109 Roster          2007       1992        11 1610612759  45  
## 4         136 Roster          2007       1992        29 1610612738  18  
## 5         185 Roster          2007       1993         1 1610612744   9  
## 6         208 Roster          2007       1993        24 1610612742  27.5
## # ℹ 41 more variables: games_start <dbl>, mins <dbl>, fgm <dbl>, fga <dbl>,
## #   fgp <dbl>, fgm3 <dbl>, fga3 <dbl>, fgp3 <dbl>, fgm2 <dbl>, fga2 <dbl>,
## #   fgp2 <dbl>, efg <dbl>, ftm <dbl>, fta <dbl>, ftp <dbl>, off_reb <dbl>,
## #   def_reb <dbl>, tot_reb <dbl>, ast <dbl>, steals <dbl>, blocks <dbl>,
## #   tov <dbl>, tot_fouls <dbl>, points <dbl>, PER <dbl>, FTr <dbl>,
## #   off_reb_pct <dbl>, def_reb_pct <dbl>, tot_reb_pct <dbl>, ast_pct <dbl>,
## #   stl_pct <dbl>, blk_pct <dbl>, tov_pct <dbl>, usg <dbl>, OWS <dbl>, …
# Handle NA values for numeric columns in merged_data
numeric_cols <- sapply(merged_data, is.numeric)
merged_data[numeric_cols] <- lapply(merged_data[numeric_cols], function(x) ifelse(is.na(x), 0, x))

# Handle NA values for character or factor columns in merged_data
char_or_factor_cols <- sapply(merged_data, function(x) is.character(x) | is.factor(x))
merged_data[char_or_factor_cols] <- lapply(merged_data[char_or_factor_cols], function(x) ifelse(is.na(x), "Unknown", x))

# Preprocess test set
merged_data <- merged_data %>%
  left_join(career_outcomes, by = "nbapersonid") %>%
  replace_na(list(career_outcomes = "Out of the League"))

data_train <- merged_data %>%
  filter(draftyear >= 2018 & draftyear <= 2021) %>%
  group_by(nbapersonid) %>%
  slice(1:4) 

# Assign single season outcomes
data_train <- data_train %>%
  mutate(season_outcome = case_when(
    `All NBA First Team`==1|`All NBA Second Team`==1|`All NBA Third Team`==1 ~ "Elite",
    all_star_game == TRUE ~ "All-Star",
    (adjusted_games_started >= 41 |adjusted_minutes_played >= 2000) ~ "Starter",
    adjusted_minutes_played >= 1000 ~ "Rotation",
    adjusted_minutes_played >= 1 ~ "Roster",
    TRUE ~ "Out of the League"
  ))

# Compute career_outcomes for each nbapersonid
career_outcomes <- data_train %>%
  group_by(nbapersonid) %>%
  slice(5:n()) %>%
  count(season_outcome) %>%
  top_n(n = 1, wt = n) %>%
  summarize(career_outcome = case_when(
    any(season_outcome == "Elite") ~ "Elite",
    any(season_outcome == "All-Star") ~ "All-Star",
    any(season_outcome == "Starter") ~ "Starter",
    any(season_outcome == "Rotation") ~ "Rotation",
    any(season_outcome == "Roster") ~ "Roster",
    TRUE ~ "Out of the League"
  )) %>%
  ungroup()

# Handle NA values in player_data
player_data[is.na(player_data)] <- 0

numeric_columns_player_data <- sapply(player_data, is.numeric)
avg_player_data_numeric <- player_data %>%
  filter(draftyear >= 2018 & draftyear <= 2021) %>%
  select(which(numeric_columns_player_data)) %>%
  group_by(nbapersonid, season) %>%
  summarise_all(mean, na.rm=TRUE) %>%
  ungroup()

player_aggregated_data <- avg_player_data_numeric %>%
  group_by(nbapersonid) %>%
  summarise(across(where(is.numeric), mean, na.rm = TRUE)) %>%
  ungroup()

# Join with career_outcomes to get test_set
test_set <- career_outcomes %>%
  left_join(player_aggregated_data, by = "nbapersonid")

player_names <- player_data %>%
  select(nbapersonid, player)

test_set <- test_set %>%
  left_join(player_names, by = "nbapersonid")
head(test_set)
## # A tibble: 6 × 49
##   nbapersonid career_outcome season draftyear draftpick   nbateamid games
##         <dbl> <chr>           <dbl>     <dbl>     <dbl>       <dbl> <dbl>
## 1     1628238 Roster          2021       2018         0 1610612756      2
## 2     1628959 Roster          2018       2018         0 1610612741     10
## 3     1628960 Starter         2020.      2018        21 1610612759.    48
## 4     1628960 Starter         2020.      2018        21 1610612759.    48
## 5     1628960 Starter         2020.      2018        21 1610612759.    48
## 6     1628960 Starter         2020.      2018        21 1610612759.    48
## # ℹ 42 more variables: games_start <dbl>, mins <dbl>, fgm <dbl>, fga <dbl>,
## #   fgp <dbl>, fgm3 <dbl>, fga3 <dbl>, fgp3 <dbl>, fgm2 <dbl>, fga2 <dbl>,
## #   fgp2 <dbl>, efg <dbl>, ftm <dbl>, fta <dbl>, ftp <dbl>, off_reb <dbl>,
## #   def_reb <dbl>, tot_reb <dbl>, ast <dbl>, steals <dbl>, blocks <dbl>,
## #   tov <dbl>, tot_fouls <dbl>, points <dbl>, PER <dbl>, FTr <dbl>,
## #   off_reb_pct <dbl>, def_reb_pct <dbl>, tot_reb_pct <dbl>, ast_pct <dbl>,
## #   stl_pct <dbl>, blk_pct <dbl>, tov_pct <dbl>, usg <dbl>, OWS <dbl>, …

Feature Engineering

# For the training set
training_set <- training_set %>%
  mutate(
    # 1. Shooting Efficiency
    shooting_efficiency = ifelse(fga != 0, fgm / fga, 0),
    
    # 2. Three-point Shooting Efficiency
    three_point_efficiency = ifelse(fga3 != 0, fgm3 / fga3, 0),
    
    # 3. Free Throw Efficiency
    free_throw_efficiency = ifelse(fta != 0, ftm / fta, 0),
    
    # 4. Points per Minute
    points_per_minute = ifelse(mins != 0, points / mins, 0),
    
    # 5. Points per Game
    points_per_game = ifelse(games != 0, points / games, 0),
    
    # 6. Minutes per Game
    mins_per_game = ifelse(games != 0, mins / games, 0),
    
    # 7. Year-on-Year Improvement
    ppg = points_per_game,
    yoy_improvement = ppg - lag(ppg, 1)
  ) %>%
  ungroup()

# For the test set
test_set <- test_set %>%
  mutate(
    # 1. Shooting Efficiency
    shooting_efficiency = ifelse(fga != 0, fgm / fga, 0),
    
    # 2. Three-point Shooting Efficiency
    three_point_efficiency = ifelse(fga3 != 0, fgm3 / fga3, 0),
    
    # 3. Free Throw Efficiency
    free_throw_efficiency = ifelse(fta != 0, ftm / fta, 0),
    
    # 4. Points per Minute
    points_per_minute = ifelse(mins != 0, points / mins, 0),
    
    # 5. Points per Game
    points_per_game = ifelse(games != 0, points / games, 0),
    
    # 6. Minutes per Game
    mins_per_game = ifelse(games != 0, mins / games, 0),
    
    # 7. Year-on-Year Improvement
    ppg = points_per_game,
    yoy_improvement = ppg - lag(ppg, 1)
  ) %>%
  ungroup()
#lasso regression to determine which features are chosen
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.1-7
# Encode career_outcome using label encoding
training_set$career_outcome <- as.numeric(factor(training_set$career_outcome))

# Split data into features and target
X <- training_set %>% select(-nbapersonid, -career_outcome,-draftyear,-season,-draftpick,-nbateamid)
y <- training_set$career_outcome

# Impute missing values with median for numeric columns
X_imputed <- X
num_cols <- sapply(X, is.numeric)
X_imputed[, num_cols] <- lapply(X_imputed[, num_cols], function(x) ifelse(is.na(x), median(x, na.rm = TRUE), x))

# Standardize the imputed features
X_scaled <- scale(X_imputed)

# Fit Lasso model
lasso_model <- glmnet(X_scaled, y, alpha = 1, lambda = cv.glmnet(X_scaled, y, alpha = 1)$lambda.min)

# Extract important features
important_features <- rownames(coef(lasso_model))[which(coef(lasso_model) != 0)]
important_features
##  [1] "(Intercept)"       "games"             "games_start"      
##  [4] "mins"              "fgp"               "ftm"              
##  [7] "ftp"               "ast"               "steals"           
## [10] "tov"               "tot_fouls"         "FTr"              
## [13] "ast_pct"           "OWS"               "DWS"              
## [16] "VORP"              "points_per_minute"

Modeling

set.seed(1)

sample <- sample(c(TRUE, FALSE), nrow(X_scaled), replace=TRUE, prob=c(0.7,0.3))
training_data <- training_set[sample,]
cross_validation <- training_set[!sample,]

#convert to data frame
training_data <- as.data.frame(training_data)
cross_validation <- as.data.frame(cross_validation)
Multinomial logistic regression model
library(nnet)
# 1. Prepare the data with the selected features
selected_features <- c("games", "games_start", "mins", "fgp2", "efg", "ftm", "ast", "steals", 
                       "tov", "tot_fouls", "PER", "FTr", "ast_pct", "stl_pct", "blk_pct", 
                       "tov_pct", "OWS", "DWS", "VORP", "three_point_efficiency", 
                       "free_throw_efficiency", "points_per_minute", "points_per_game")

X_selected <- training_data[, selected_features]
y <- as.factor(training_data$career_outcome)  # ensure the response variable is a factor

# 2. Fit the multiple logistic regression model
multinom_model <- multinom(y ~ ., data = data.frame(y, X_selected))
## # weights:  125 (96 variable)
## initial  value 1260.189885 
## iter  10 value 773.399946
## iter  20 value 675.757956
## iter  30 value 606.207788
## iter  40 value 554.578863
## iter  50 value 421.473587
## iter  60 value 384.268803
## iter  70 value 380.118241
## iter  80 value 379.096946
## iter  90 value 378.368910
## iter 100 value 377.529767
## final  value 377.529767 
## stopped after 100 iterations
# 3. Summarize the model to interpret the results
#summary(multinom_model)

# 4. Performance on cross validation set
actual_outcomes <- as.numeric(cross_validation$career_outcome)
prediction <- predict(multinom_model,newdata=cross_validation)
accuracy <- sum(prediction == actual_outcomes, na.rm = TRUE)/length(actual_outcomes)

cat("The accurarcy is:",accuracy)#0.7402985
## The accurarcy is: 0.7402985
RandomForest
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
#fit the model
data_clean <- na.omit(data.frame(y, X_selected))
rf_model <- randomForest(y ~ ., data = data_clean, ntree = 2000)

#summary
#summary(rf_model)

# 4. Performance on cross validation set
actual_outcomes <- as.numeric(cross_validation$career_outcome)
prediction <- predict(rf_model,newdata=cross_validation)
accuracy <- sum(prediction == actual_outcomes, na.rm = TRUE)/length(actual_outcomes)

cat("The accurarcy is:",accuracy)#0.7492537,better
## The accurarcy is: 0.7492537

Do the prediction on the test set

multinom_prediction <- predict(multinom_model,test_set)
rf_prediction <- predict(rf_model,test_set)

#multinom_prediction
#rf_prediction

#with probabilities
multinom_prediction_prob <- predict(multinom_model,test_set,type='prob')
rf_prediction_prob <- predict(rf_model,test_set,type='prob')

#multinom_prediction_prob
#rf_prediction_prob
library(tibble)
labels <- c("Elite", "All-Star", "Starter", "Rotation", "Roster")
# Identify and remove duplicate players, retaining only unique player indices
unique_player_indices <- !duplicated(test_set$player)
player_indices <- which(unique_player_indices)

# Filter predictions and probabilities for the specific players
multinom_preds_for_players <- multinom_prediction[player_indices]
rf_preds_for_players <- rf_prediction[player_indices]

multinom_probs_for_players <- multinom_prediction_prob[player_indices, ]
rf_probs_for_players <- rf_prediction_prob[player_indices, ]

# Convert probabilities into a readable format 
multinom_probs_readable <- apply(multinom_probs_for_players, 1, function(row) {
  paste(labels, round(row, 3), sep=": ", collapse=" | ")
})
rf_probs_readable <- apply(rf_probs_for_players, 1, function(row) {
  paste(labels, round(row, 3), sep=": ", collapse=" | ")
})

# Construct the data frame
predictions_df <- tibble(
  Player = test_set$player[player_indices],
  Multinom_Predictions = labels[as.numeric(multinom_preds_for_players)],
  RF_Predictions = labels[as.numeric(rf_preds_for_players)],
  Multinom_Probabilities = multinom_probs_readable,
  RF_Probabilities = rf_probs_readable
)

print(predictions_df)
## # A tibble: 396 × 5
##    Player             Multinom_Predictions RF_Predictions Multinom_Probabilities
##    <chr>              <chr>                <chr>          <chr>                 
##  1 Paris Bass         Starter              Starter        Elite: 0 | All-Star: …
##  2 Rawle Alkins       Starter              Starter        Elite: 0 | All-Star: …
##  3 Grayson Allen      Starter              Starter        Elite: 0 | All-Star: …
##  4 Kostas Antetokoun… Starter              Starter        Elite: 0 | All-Star: …
##  5 Udoka Azubuike     Starter              Starter        Elite: 0 | All-Star: …
##  6 Marvin Bagley      Starter              Starter        Elite: 0 | All-Star: …
##  7 Mohamed Bamba      Starter              Starter        Elite: 0 | All-Star: …
##  8 Keita Bates-Diop   Starter              Starter        Elite: 0 | All-Star: …
##  9 Brian Bowen II     Starter              Starter        Elite: 0 | All-Star: …
## 10 Mikal Bridges      Roster               Roster         Elite: 0 | All-Star: …
## # ℹ 386 more rows
## # ℹ 1 more variable: RF_Probabilities <chr>

1. Overview:

Understanding the future trajectory of an NBA player is a multifaceted challenge. Our model is designed to demystify this process by harnessing data from players’ early careers to forecast their potential.

We have rigorously analyzed historical data, focusing on players who began their careers before 2015. Key performance metrics, including games played, starting appearances, minutes on the court, shooting accuracy, defensive contributions, and several others, have been meticulously incorporated. These parameters have been chosen because they consistently demonstrate a strong correlation with long-term player success.

Utilizing these metrics, our predictive model offers an informed assessment of the career trajectories for players who debuted between 2018 and 2021. In rigorous testing environments, our model demonstrated a commendable accuracy of 75%. While no predictive tool can claim infallibility, we believe our model serves as an invaluable asset, providing robust, data-driven insights to guide front office decisions regarding player potential.

In essence, our approach marries historical insights with advanced analytics to offer a strategic perspective on player potential, assisting your team in making enlightened decisions for the future of your franchise.

####2. Strenth and weakness of the model:

Strengths:

Comprehensive Metrics: Our model evaluates a wide array of on-court metrics, providing a holistic view of a player’s performance.

Historical Context: By examining players who started before 2015, we draw from a robust dataset that covers various playing styles and eras.

High Predictive Accuracy: With a commendable 75% accuracy rate in test environments, our model proves to be both reliable and efficient.

Weaknesses:

Data Limitations: Our dataset only encompasses data from before 2015. As a result, recent shifts in the NBA landscape—such as the emphasis on three-point shooting or evolving defensive schemes—might not be fully captured.

Overlooked Variables: Sports, especially a dynamic game like basketball, have various externalities. Our model may not account for factors such as player injuries, changing team dynamics, coaching styles, or off-court issues that can influence a player’s career trajectory.

3.

Improvement Pathways:

To address these shortcomings, our next steps could involve:

Expanding Data Range: Incorporating data from recent years would help capture newer trends and patterns in the NBA, making our predictions more relevant. Inclusion of External Factors: We recognize the importance of non-statistical elements. Future iterations could include more qualitative data or secondary metrics to evaluate off-court influences, team dynamics, and other intangibles. Model advanced model: with more rows of data, we can developed deep learning model to do the classfication

4.visulization

library(ggplot2)
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# Multinomial Logistic Regression
probs_multinom <- predict(multinom_model, newdata = cross_validation, type = "probs")

# Random Forest
probs_rf <- predict(rf_model, newdata = cross_validation, type = "prob")

roc_data <- data.frame()

classes <- levels(factor(cross_validation$career_outcome))

# Loop over each class
for(class in classes){
  
  # Convert actual outcomes to a binary format: "PositiveClass" for the current class, "Other" otherwise
  binary_outcome <- ifelse(cross_validation$career_outcome == class, "PositiveClass", "Other")
  
  # Compute the ROC curve for the multinomial model with direction specified
  roc_multinom <- roc(binary_outcome, probs_multinom[, class], levels=c("Other", "PositiveClass"), quiet = TRUE)
  
  # Compute the ROC curve for the random forest model with direction specified
  roc_rf <- roc(binary_outcome, probs_rf[, class], levels=c("Other", "PositiveClass"), quiet=TRUE)
  
  # Append ROC data to the data frame
  roc_data <- rbind(roc_data, 
                    data.frame(Model="Multinom", Class=class, TPR=roc_multinom$sensitivities, FPR=roc_multinom$specificities),
                    data.frame(Model="Random Forest", Class=class, TPR=roc_rf$sensitivities, FPR=roc_rf$specificities)
  )
}

# Checking the head of roc_data again
head(roc_data)
##      Model Class TPR         FPR
## 1 Multinom     1   1 0.000000000
## 2 Multinom     1   1 0.003058104
## 3 Multinom     1   1 0.006116208
## 4 Multinom     1   1 0.009174312
## 5 Multinom     1   1 0.012232416
## 6 Multinom     1   1 0.015290520
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(ggplot2)

# Define colors
colors <- c("Multinom" = "#002c53", "Random Forest" = "#ffa510")

plots <- list()

for(class in classes){
  
  binary_outcome <- ifelse(cross_validation$career_outcome == class, "PositiveClass", "Other")
  
  auc_multinom <- auc(roc(binary_outcome, probs_multinom[, class], levels=c("Other", "PositiveClass")))
  auc_rf <- auc(roc(binary_outcome, probs_rf[, class], levels=c("Other", "PositiveClass")))
  
  # Create a subset of roc_data for the specific class
  subset_data <- roc_data[roc_data$Class == class,]
  
  # Create the ggplot object for the specific class
  p <- ggplot(subset_data, aes(x = FPR, y = TPR, color = Model)) + 
    geom_line(size = 1) +
    geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "grey50") +
    labs(title = paste("ROC Curve for Class", class, "with AUC", "\nMultinom AUC:", round(auc_multinom, 3), "Random Forest AUC:", round(auc_rf, 3)), 
         x = "False Positive Rate", y = "True Positive Rate") +
    theme_minimal() +
    theme(legend.position = "bottom") +
    scale_color_manual(values = colors, name = "Model")
  
  # Convert the ggplot object to a plotly object and append to the plots list
  plots[[class]] <- ggplotly(p)
}
## Setting direction: controls < cases
## Setting direction: controls < cases
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
## Setting direction: controls < cases
#view
plots[[1]]
plots[[2]]
plots[[3]]
plots[[4]]
plots[[5]]

5. Predictions for specific players

specific_players <- c("Shai Gilgeous-Alexander", "Zion Williamson", "James Wiseman", "Josh Giddey")
# Get unique indices for specific players
unique_player_indices <- which(!duplicated(test_set$player) & test_set$player %in% specific_players)

# Update the predictions and probabilities using the unique indices
multinom_preds_for_players <- multinom_prediction[unique_player_indices]
rf_preds_for_players <- rf_prediction[unique_player_indices]

multinom_probs_for_players <- multinom_prediction_prob[unique_player_indices, ]
rf_probs_for_players <- rf_prediction_prob[unique_player_indices, ]

# Convert probabilities into a readable format 
multinom_probs_readable <- apply(multinom_probs_for_players, 1, function(row) {
  paste(labels, round(row, 3), sep=": ", collapse=" | ")
})
rf_probs_readable <- apply(rf_probs_for_players, 1, function(row) {
  paste(labels, round(row, 3), sep=": ", collapse=" | ")
})

# Construct the tibble
specific_predictions_df <- tibble(
  Player = specific_players,
  Multinom_Predictions = labels[as.numeric(multinom_preds_for_players)],
  RF_Predictions = labels[as.numeric(rf_preds_for_players)],
  Multinom_Probabilities = multinom_probs_readable,
  RF_Probabilities = rf_probs_readable
)

print(specific_predictions_df)
## # A tibble: 4 × 5
##   Player              Multinom_Predictions RF_Predictions Multinom_Probabilities
##   <chr>               <chr>                <chr>          <chr>                 
## 1 Shai Gilgeous-Alex… Roster               Roster         Elite: 0.065 | All-St…
## 2 Zion Williamson     All-Star             Elite          Elite: 0.019 | All-St…
## 3 James Wiseman       Starter              Starter        Elite: 0 | All-Star: …
## 4 Josh Giddey         Starter              Roster         Elite: 0.025 | All-St…
## # ℹ 1 more variable: RF_Probabilities <chr>

6.

library(reactable)

my_labels <- c("Elite", "All-Star", "Starter", "Rotation", "Roster")
# Identify the rows for players drafted in 2019-2021:
draft_years <- c(2019, 2020, 2021)
filtered_row_numbers <- which(test_set$draftyear %in% draft_years)

# Extract predictions
drafted_multinom_preds <- multinom_prediction[filtered_row_numbers]
drafted_rf_preds <- rf_prediction[filtered_row_numbers]

drafted_multinom_probs <- multinom_prediction_prob[filtered_row_numbers, ]
drafted_rf_probs <- rf_prediction_prob[filtered_row_numbers, ]

# Convert probabilities to readable format
multinom_probs_readable <- apply(drafted_multinom_probs, 1, function(row) {
  paste(my_labels, round(row, 3), sep=": ", collapse=" | ")
})
rf_probs_readable <- apply(drafted_rf_probs, 1, function(row) {
  paste(my_labels, round(row, 3), sep=": ", collapse=" | ")
})


#  Construct the tibble
predictions_df <- tibble(
  Player = test_set$player[filtered_row_numbers],
  Multinom_Predictions = labels[as.numeric(drafted_multinom_preds)],
  RF_Predictions = labels[as.numeric(drafted_rf_preds)],
  Multinom_Probabilities = multinom_probs_readable,
  RF_Probabilities = rf_probs_readable
)
unique_predictions_df <- distinct(predictions_df, Player, .keep_all = TRUE)
reactable(unique_predictions_df)

3.

Part 2 – Predicting Team Stats

In this section, we’re going to introduce a simple way to predict team offensive rebound percent in the next game and then discuss ways to improve those predictions.

Question 1

Using the rebounding_data dataset, we’ll predict a team’s next game’s offensive rebounding percent to be their average offensive rebounding percent in all prior games. On a single game level, offensive rebounding percent is the number of offensive rebounds divided by their number offensive rebound “chances” (essentially the team’s missed shots). On a multi-game sample, it should be the total number of offensive rebounds divided by the total number of offensive rebound chances.

Please calculate what OKC’s predicted offensive rebound percent is for game 81 in the data. That is, use games 1-80 to predict game 81.

# Filter out OKC's games 1-80
okc_data <- rebounding_data %>%
  filter(team == "OKC" & game_number <= 80)

# Calculate total offensive rebounds and total rebound chances
total_offensive_rebounds <- sum(okc_data$offensive_rebounds)
total_rebound_chances <- sum(okc_data$off_rebound_chances)

# Predict the offensive rebound percentage for game 81
predicted_oreb_pct_for_game_81 <- total_offensive_rebounds / total_rebound_chances
print(predicted_oreb_pct_for_game_81)
## [1] 0.2886898
# Display the result
cat("Based on OKC's performance in games 1-80, the predicted offensive rebounding percentage for game 81 is approximately",
    round(predicted_oreb_pct_for_game_81 * 100, 1), "%", "\n")
## Based on OKC's performance in games 1-80, the predicted offensive rebounding percentage for game 81 is approximately 28.9 %

ANSWER 1:

28.9%

Question 2

There are a few limitations to the method we used above. For example, if a team has a great offensive rebounder who has played in most games this season but will be out due to an injury for the next game, we might reasonably predict a lower team offensive rebound percent for the next game.

Please discuss how you would think about changing our original model to better account for missing players. You do not have to write any code or implement any changes, and you can assume you have access to any reasonable data that isn’t provided in this project. Try to be clear and concise with your answer.

ANSWER 2:

When accounting for the absence of key players, particularly those who significantly impact a team’s performance, our model would need more sophistication. Here’s how we might approach the task:

1.Individual Player Metrics: We should first analyze the rebounding statistics on a player-by-player basis, not just at the team level. This will allow us to understand the contribution of each player to the overall team’s offensive rebound percentage.

2.Player Impact Factor: For each player, calculate an “Impact Factor” based on their contribution to the team’s offensive rebounding. This factor might be a combination of:

Their average offensive rebounds per game. The percentage of team’s total offensive rebounds they are responsible for. Their overall presence on the court, like minutes played. Player Availability Data: We would need a dataset that tells us about player availability for each game. This dataset would include details about injuries, suspensions, or any other reason a player might miss a game.

3.Dynamic Prediction: Based on the availability of players for game 81:

Subtract the ‘Impact Factor’ of the missing key players from the team’s overall offensive rebound percentage to predict a revised percentage for that game. If backups or replacements are known to be playing, their ‘Impact Factor’ could be added, though it’s essential to note that a backup player might not perform at the same level in a starting role. Historical Analysis: If the key player has missed games before, analyze the team’s offensive rebounding performance during those games to understand the real-world impact. This historical evidence can be used to adjust our predictions further.

4.Consider Team Strategy and Tactics: Teams might adjust their strategies based on the players available. If we can analyze how strategies have changed in past games when key players were absent, we can incorporate these tactical adjustments into our predictions.

By integrating these elements, our model will be better suited to predict the impact of missing players and provide a more accurate forecast for a team’s performance.

Question 3

In question 2, you saw and discussed how to deal with one weakness of the model. For this question, please write about 1-3 other potential weaknesses of the simple average model you made in question 1 and discuss how you would deal with each of them. You may either explain a weakness and discuss how you’d fix that weakness, then move onto the next issue, or you can start by explaining multiple weaknesses with the original approach and discuss one overall modeling methodology you’d use that gets around most or all of them. Again, you do not need to write any code or implement any changes, and you can assume you have access to any reasonable data that isn’t provided in this project. Try to be clear and concise with your answer.

ANSWER 3:

  1. Lack of Time Sensitivity: Weakness: The model assumes that all games from 1 to 80 are of equal importance. However, team performance can evolve over the season due to various reasons: player transfers, improvements in player skills, deteriorating performances.

Solution: Implement a weighted average that gives more importance to recent games. For instance, games closer to the 81st game might be given a higher weight, suggesting their performance is more indicative of the upcoming game.

  1. Variability and Outliers: Weakness: Some games might have exceptional performances or disastrous games that don’t reflect the team’s usual performance. A simple average might be affected significantly by such outliers.

Solution: Apply statistical methods to detect and possibly remove or adjust for outliers. Alternatively, using a median or a trimmed mean might provide a more representative central value for the team’s performance, as it reduces the influence of extreme values.

  1. Ignoring Opponent Strength: Weakness: Not all opponents are of equal strength, and a team’s rebounding rate can vary significantly based on its opponent. By just taking a simple average, we might be overlooking whether OKC played against more challenging or easier teams for the majority of those 80 games.

Solution: Incorporate an opponent strength metric. If a team consistently performed well in rebounds against strong opponents, it might be a sign that they have a strong rebounding strategy or talent. We can adjust the rebound percentage based on the strength of the opponent they’re about to face in game 81.

Overall Modeling Methodology: To address the majority of these weaknesses, a regression model could be employed. By incorporating time as a variable, opponent strength, and other potential predictors, we can derive a more dynamic and responsive prediction for game 81. This approach provides a framework that can be iteratively refined by including more relevant variables or adjusting for nuances in the data.